80 research outputs found
A Comparison of Feature-Based and Neural Scansion of Poetry
Automatic analysis of poetic rhythm is a challenging task that involves
linguistics, literature, and computer science. When the language to be analyzed
is known, rule-based systems or data-driven methods can be used. In this paper,
we analyze poetic rhythm in English and Spanish. We show that the
representations of data learned from character-based neural models are more
informative than the ones from hand-crafted features, and that a
Bi-LSTM+CRF-model produces state-of-the art accuracy on scansion of poetry in
two languages. Results also show that the information about whole word
structure, and not just independent syllables, is highly informative for
performing scansion.Comment: RANLP 201
Semantikan oinarritutako bilaketak: Kyoto proiektua
Semantic-based research: Kyoto Project.
In the digital management of documentation, the use of the text itself can be very interesting, in addition to the descriptors. Many descriptors are also text. The use of linguistic engineering techniques opens up new options for accessing information from these databases: multilingual access, semantic grouping, access based on similarity, question-answer systems, information inference, etc. This paper looks in more detail at the possibilities based on semantics, setting out the research areas being developed by the authors as part of the European Kyoto project
Semantikan oinarritutako bilaketak: Kyoto proiektua
Semantic-based research: Kyoto Project.
In the digital management of documentation, the use of the text itself can be very interesting, in addition to the descriptors. Many descriptors are also text. The use of linguistic engineering techniques opens up new options for accessing information from these databases: multilingual access, semantic grouping, access based on similarity, question-answer systems, information inference, etc. This paper looks in more detail at the possibilities based on semantics, setting out the research areas being developed by the authors as part of the European Kyoto project
Strategies to develop Language Technologies for Less-Resourced Languages based on the case of Basque
IXA group has developed during 23 years a basic set of resources, tools and applications for Basque following to an initial strategy which has been adapted according to technological changes. We think that our strategy and experience can be a reference for other less resourced languages. According to a six level classification of world languages, we estimate that this strategy may be useful for several hundred languages, those that have developed a written standard but that still are beginners in Human Language Technology
A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity
This is an Accepted Manuscript of an article published by Taylor & Francis in Journal of Quantitative Linguistics on 01 Mar 2020, available online: http://www.tandfonline.com/10.1080/09296174.2020.1732177The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languagesThis work has received financial support from DOMINO project [PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE]; eRisk project [RTI2018-093336-B-C21]; the ConsellerÃa de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08, Consolidation and structuring of Groups with Growth Potential: 745ED431B 2017/39) and the European Regional Development Fund (ERDF)S
A spelling corrector for basque based on morphology
This paper describes the components used in the elaboration of the commercial Xuxen spelling checker/corrector for Basque. Because Basque is a highly inflected and agglutinative language, the spelling checker/corrector has been conceived as a by-product of a general purpose morphological analyser/generator. The spelling checker/corrector performs morphological decomposition in order to check misspellings and, to correct them, uses a new strategy which combines the use of an additional two-level morphological subsystem for orthographic errors, and the recognition of correct morphemes inside the world-form during the generation of proposals for typographical errors. Due to a late process of standardization of Basque, Xuxen is intended as a useful tool for standardization purposes of present day written Basque
Teknologia garatzeko estrategiak baliabide urriko hizkuntzetarako: euskararen eta Ixa taldearen adibidea
El artÃculo comienza presentando varios datos que muestran la situación de la lengua vasca, y a continuación proponiendo una clasificación para las lenguas del mundo según sea su presencia en Internet y en la tecnologÃa de la lengua. El cuerpo del artÃculo presenta el trabajo hecho por el grupo Ixa en el campo del procesamiento automático del euskara, identificando sus siete hitos principales y describiendo la estrategia que ha guiado este desarrollo. Se plantea que esta estrategia puede servir como referencia para 190 lenguas que según la lasificación propuesta no poseen recursos de tecnologÃa de la lengua pero si poseen una mÃnima presencia significativa en Internet.Euskararen egoeraren inguruan hainbat datu ematen dira labur-labur, eta horrekin batera munduko hizkuntzak sailkatzeko proposamen bat aurkezten da Interneten eta hizkuntz teknologian duten egoeren araberakoa. Euskararen prozesaketa automatikoan Ixa taldeak izan duen bilakaeraren nondik norakoak zehazten dira gero, hainbat mugarri azpimarratuz eta ibilbide hori jarraitzeko erabili den estrategia deskribatuz. Munduko 190 hizkuntzentzat erreferentzia izan daiteke estrategia hori, hain zuen, Interneten presentzia minimo eduki bai baina oraindik hizkuntza-teknologia mota hau landu ez duten hizkuntzentzat
TweetMT : a parallel microblog corpus
We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested
Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español
En este artÃculo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados
obtenidos.Postprint (published version
Massively multilingual accessible audioguides via cell phones
Bidaide is a web service that allows the visitors of a museum, route or building to read or listen to explanations relative to the visited place on their own mobile and in their own language. The visitor can access the explanations in various ways: by scanning some QR codes located in the place, by GPS positioning (in outdoor routes), or by automatic Bluetooth proximity activation. This makes it accessible for people with reduced or null vision. On the other hand, this platform also offers to the manager of the visited site the most advanced language resources to create the texts and audios of the explanations in many languages
- …